September 3, 2015

IS606 R Package

A fix has been pushed to Github that addresses the error when calling startLab. You will need to reinstall the package to get the fix.

devtools::install_github('jbryer/IS606')

Recall that it will copy the lab files to the working directory by default:

getwd()

To start the lab, call the startLab function:

IS606::startLab('Lab1')

Meetup Presentations

  1. Jay Narhan (1.69)

Intro to Data

We will use the lego R package in this class which contains information about every Lego set manufactured from 1970 to 2014, a total of 5710 sets.

devtools::install_github("seankross/lego")
library(lego)
data(legosets)

Types of Variables

  • Numerical (quantitative)
    • Continuous
    • Discrete
  • Categorical (qualitative)
    • Regular categorical
    • Ordinal

Types of Variables

str(legosets)
## Classes 'tbl_df', 'tbl' and 'data.frame':    5710 obs. of  14 variables:
##  $ Item_Number : chr  "10241" "10242" "10243" "10244" ...
##  $ Name        : chr  "Maersk Line Triple-E" "Mini Cooper MK VII" "Parisian Restaurant" "Fairground Mixer" ...
##  $ Year        : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ Theme       : chr  "Advanced Models" "Advanced Models" "Advanced Models" "Advanced Models" ...
##  $ Subtheme    : chr  "Maersk" "Vehicles" "Modular Buildings" "Miscellaneous" ...
##  $ Pieces      : int  1518 1077 2469 1746 9 12 29 121 39 14 ...
##  $ Minifigures : int  NA NA 5 12 1 NA NA 2 NA 1 ...
##  $ Image_URL   : chr  "http://www.1000steine.com/brickset/images/10241-1.jpg" "http://www.1000steine.com/brickset/images/10242-1.jpg" "http://www.1000steine.com/brickset/images/10243-1.jpg" "http://www.1000steine.com/brickset/images/10244-1.jpg" ...
##  $ GBP_MSRP    : num  109.99 74.99 132.99 119.99 5.99 ...
##  $ USD_MSRP    : num  149.99 99.99 159.99 149.99 6.99 ...
##  $ CAD_MSRP    : num  179.99 119.99 189.99 179.99 8.99 ...
##  $ EUR_MSRP    : num  129.99 89.99 149.99 129.99 6.99 ...
##  $ Packaging   : chr  "Box" "Box" "Box" "Box" ...
##  $ Availability: chr  "LEGO exclusive" "LEGO exclusive" "LEGO exclusive" "LEGO exclusive" ...

Qualitative Variables

Descriptive statistics:

  • Contingency Tables
  • Proportional Tables

Plot types:

  • Bar plot
  • Mosaic plot

Contingency Tables

table(legosets$Availability, useNA='ifany')
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##                   662                     1                  1792 
##           Promotional Promotional (Airline)                Retail 
##                   143                    12                  2753 
##      Retail - limited               Unknown 
##                   343                     4
table(legosets$Availability, legosets$Packaging, useNA='ifany')
##                        
##                         Blister pack  Box Box with backing card Bucket
##   LEGO exclusive                  37  133                     0      1
##   LEGOLAND exclusive               0    1                     0      0
##   Not specified                    0   19                     0      0
##   Promotional                      5   40                     0      0
##   Promotional (Airline)            0   11                     0      0
##   Retail                          53 2302                    10     30
##   Retail - limited                 2  261                     1      4
##   Unknown                          0    1                     0      0
##                        
##                         Canister Foil pack Loose Parts Not specified Other
##   LEGO exclusive               0         0          68             5     3
##   LEGOLAND exclusive           0         0           0             0     0
##   Not specified                0         5           0          1742     0
##   Promotional                  0         0           4             2     3
##   Promotional (Airline)        0         0           0             1     0
##   Retail                      78       203           0             0    23
##   Retail - limited             0         1           0             1     0
##   Unknown                      0         0           0             0     0
##                        
##                         Plastic box Polybag Shrink-wrapped  Tag  Tub
##   LEGO exclusive                  1     408              0    6    0
##   LEGOLAND exclusive              0       0              0    0    0
##   Not specified                   6      19              0    0    1
##   Promotional                     1      87              0    0    1
##   Promotional (Airline)           0       0              0    0    0
##   Retail                          0       4             18    0   32
##   Retail - limited                1      67              0    0    5
##   Unknown                         0       3              0    0    0

Proportional Tables

prop.table(table(legosets$Availability))
## 
##        LEGO exclusive    LEGOLAND exclusive         Not specified 
##          0.1159369527          0.0001751313          0.3138353765 
##           Promotional Promotional (Airline)                Retail 
##          0.0250437828          0.0021015762          0.4821366025 
##      Retail - limited               Unknown 
##          0.0600700525          0.0007005254

Bar Plots

barplot(table(legosets$Availability), las=3)

Bar Plots

barplot(prop.table(table(legosets$Availability)), las=3)

Mosaic Plot

library(vcd)
mosaic(HairEyeColor, shade=TRUE, legend=TRUE)

Quantitative Variables

Descriptive statistics:

  • Mean
  • Median
  • Quartiles
  • Variance: \({ s }^{ 2 }=\sum _{ i=1 }^{ n }{ \frac { { \left( { x }_{ i }-\bar { x } \right) }^{ 2 } }{ n-1 } }\)
  • Standard deviation: \(s=\sqrt{s^2}\)

Plot types:

  • Dot plots
  • Histograms
  • Density plots
  • Box plots
  • Scatterplots

Measures of Center

mean(legosets$Pieces, na.rm=TRUE)
## [1] 211.6356
median(legosets$Pieces, na.rm=TRUE)
## [1] 79

Measures of Spread

var(legosets$Pieces, na.rm=TRUE)
## [1] 127887.8
sqrt(var(legosets$Pieces, na.rm=TRUE))
## [1] 357.6141
sd(legosets$Pieces, na.rm=TRUE)
## [1] 357.6141


fivenum(legosets$Pieces, na.rm=TRUE)
## [1]    1   28   79  250 5922
IQR(legosets$Pieces, na.rm=TRUE)
## [1] 222

The summary Function

summary(legosets$Pieces)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    28.0    79.0   211.6   250.0  5922.0     101

The psych Package

library(psych)
describe(legosets$Pieces, skew=FALSE)
##   vars    n   mean     sd median trimmed   mad min  max range   se
## 1    1 5609 211.64 357.61     79   136.5 99.33   1 5922  5921 4.77
describeBy(legosets$Pieces, group = legosets$Availability, skew=FALSE, mat=TRUE)
##    item                group1 vars    n      mean        sd median
## 11    1        LEGO exclusive    1  626 184.68371 481.80713   42.0
## 12    2    LEGOLAND exclusive    1    1 320.00000        NA  320.0
## 13    3         Not specified    1 1740 154.30057 336.36346   44.5
## 14    4           Promotional    1  139  46.84173  96.32623   27.0
## 15    5 Promotional (Airline)    1   12 126.16667  47.01612  127.0
## 16    6                Retail    1 2745 243.70237 289.93108  137.0
## 17    7      Retail - limited    1  342 367.10819 598.42732  193.5
## 18    8               Unknown    1    4  27.50000  15.96872   30.0
##      trimmed      mad min  max range        se
## 11  60.94024  40.0302   1 4287  4286 19.256886
## 12 320.00000   0.0000 320  320     0        NA
## 13  79.90517  54.1149   1 5195  5194  8.063697
## 14  30.92035  17.7912   1 1000   999  8.170284
## 15 130.10000  22.9803  10  203   193 13.572384
## 16 188.78198 152.7078   1 3803  3802  5.533802
## 17 233.17153 227.5791   1 5922  5921 32.359243
## 18  27.50000  12.6021   6   44    38  7.984360

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Dot Plot

stripchart(legosets$Pieces)

Dot Plot

par.orig <- par(mar=c(1,10,1,1))
stripchart(legosets$Pieces ~ legosets$Availability, las=1)

par(par.orig)

Histograms

hist(legosets$Pieces)

Transformations

With highly skewed distributions, it is often helpful to transform the data. The log transformation is a common approach, especially when dealing with salary or similar data.

hist(log(legosets$Pieces))

Density Plots

plot(density(legosets$Pieces, na.rm=TRUE), main='Lego Pieces per Set')

Density Plot (log tansformed)

plot(density(log(legosets$Pieces), na.rm=TRUE), main='Lego Pieces per Set (log transformed)')

Box Plots

boxplot(legosets$Pieces)

boxplot(log(legosets$Pieces))

Scatter Plots

plot(legosets$Pieces, legosets$USD_MSRP)

Examining Possible Outliers (expensive sets)

legosets[which(legosets$USD_MSRP >= 400),]
##      Item_Number                                   Name Year        Theme
## 453      2000430             Identity and Landscape Kit 2013 Serious Play
## 454      2000431                        Connections Kit 2013 Serious Play
## 1618     2000409                 Window Exploration Bag 2010 Serious Play
## 2417       10179 Ultimate Collector's Millennium Falcon 2007    Star Wars
##                       Subtheme Pieces Minifigures
## 453                                NA          NA
## 454                              2455          NA
## 1618                             4900          NA
## 2417 Ultimate Collector Series   5195           5
##                                                    Image_URL GBP_MSRP
## 453  http://www.1000steine.com/brickset/images/2000430-1.jpg       NA
## 454  http://www.1000steine.com/brickset/images/2000431-1.jpg   490.18
## 1618 http://www.1000steine.com/brickset/images/2000409-1.jpg   314.99
## 2417   http://www.1000steine.com/brickset/images/10179-1.jpg   342.49
##      USD_MSRP CAD_MSRP EUR_MSRP     Packaging  Availability
## 453    789.99   789.99       NA Not specified Not specified
## 454    754.99   754.99       NA Not specified Not specified
## 1618   484.99   484.99       NA Not specified Not specified
## 2417   499.99       NA       NA Not specified Not specified

Examining Possible Outliers (big sets)

legosets[which(legosets$Pieces >= 4000),]
##      Item_Number                                   Name Year
## 1615       10214                           Tower Bridge 2010
## 1618     2000409                 Window Exploration Bag 2010
## 2194       10189                              Taj Mahal 2008
## 2417       10179 Ultimate Collector's Millennium Falcon 2007
##                Theme                  Subtheme Pieces Minifigures
## 1615 Advanced Models                 Buildings   4287          NA
## 1618    Serious Play                             4900          NA
## 2194 Advanced Models                 Buildings   5922          NA
## 2417       Star Wars Ultimate Collector Series   5195           5
##                                                    Image_URL GBP_MSRP
## 1615   http://www.1000steine.com/brickset/images/10214-1.jpg   209.99
## 1618 http://www.1000steine.com/brickset/images/2000409-1.jpg   314.99
## 2194   http://www.1000steine.com/brickset/images/10189-1.jpg   199.99
## 2417   http://www.1000steine.com/brickset/images/10179-1.jpg   342.49
##      USD_MSRP CAD_MSRP EUR_MSRP     Packaging     Availability
## 1615   239.99   299.99   219.99           Box   LEGO exclusive
## 1618   484.99   484.99       NA Not specified    Not specified
## 2194   299.99   399.99       NA           Box Retail - limited
## 2417   499.99       NA       NA Not specified    Not specified

plot(legosets$Pieces, legosets$USD_MSRP)
bigAndExpensive <- legosets[which(legosets$Pieces >= 4000 | legosets$USD_MSRP >= 400),]
text(bigAndExpensive$Pieces, bigAndExpensive$USD_MSRP, labels=bigAndExpensive$Name)

Gapminder

Likert Scales

Likert scales are a type of questionaire where respondents are asked to rate items on scales usually ranging from four to seven levels (e.g. strongly disagree to strongly agree).

library(likert)
library(reshape)
data(pisaitems)
items24 <- pisaitems[,substr(names(pisaitems), 1,5) == 'ST24Q']
items24 <- rename(items24, c(
            ST24Q01="I read only if I have to.",
            ST24Q02="Reading is one of my favorite hobbies.",
            ST24Q03="I like talking about books with other people.",
            ST24Q04="I find it hard to finish books.",
            ST24Q05="I feel happy if I receive a book as a present.",
            ST24Q06="For me, reading is a waste of time.",
            ST24Q07="I enjoy going to a bookstore or a library.",
            ST24Q08="I read only to get information that I need.",
            ST24Q09="I cannot sit still and read for more than a few minutes.",
            ST24Q10="I like to express my opinions about books I have read.",
            ST24Q11="I like to exchange books with my friends."))

likert R Package

l24 <- likert(items24)
summary(l24)
##                                                        Item      low
## 10   I like to express my opinions about books I have read. 41.07516
## 5            I feel happy if I receive a book as a present. 46.93475
## 8               I read only to get information that I need. 50.39874
## 7                I enjoy going to a bookstore or a library. 51.21231
## 3             I like talking about books with other people. 54.99129
## 11                I like to exchange books with my friends. 55.54115
## 2                    Reading is one of my favorite hobbies. 56.64470
## 1                                 I read only if I have to. 58.72868
## 4                           I find it hard to finish books. 65.35125
## 9  I cannot sit still and read for more than a few minutes. 76.24524
## 6                       For me, reading is a waste of time. 82.88729
##    neutral     high     mean        sd
## 10       0 58.92484 2.604913 0.9009968
## 5        0 53.06525 2.466751 0.9446590
## 8        0 49.60126 2.484616 0.9089688
## 7        0 48.78769 2.428508 0.9164136
## 3        0 45.00871 2.328049 0.9090326
## 11       0 44.45885 2.343193 0.9609234
## 2        0 43.35530 2.344530 0.9277495
## 1        0 41.27132 2.291811 0.9369023
## 4        0 34.64875 2.178299 0.8991628
## 9        0 23.75476 1.974736 0.8793028
## 6        0 17.11271 1.810093 0.8611554

likert Plots

plot(l24)

likert Plots

plot(l24, type='heat')

likert Plots

plot(l24, type='density')

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Pie Charts

There is only one pie chart in OpenIntro Statistics (Diez, Barr, & Çetinkaya-Rundel, 2015, p. 48). Consider the following three pie charts that represent the preference of five different colors. Is there a difference between the three pie charts? This is probably a difficult to answer.

Source: https://en.wikipedia.org/wiki/Pie_chart.

Just say NO to pie charts!

"There is no data that can be displayed in a pie chart that cannot better be displayed in some other type of chart"

John Tukey

Sampling vs. Census

A census involves collecting data for the entire population of interest. This is problematic for several reasons, including:

  • It can be difficult to complete a census: there always seem to be some individuals who are hard to locate or hard to measure. And these difficult-to-find people may have certain characteristics that distinguish them from the rest of the population.
  • Populations rarely stand still. Even if you could take a census, the population changes constantly, so it’s never possible to get a perfect measure.
  • Taking a census may be more complex than sampling.

Sampling involves measuring a subset of the population of interest, usually randomly.

Sampling Bias

  • Non-response: If only a small fraction of the randomly sampled people choose to respond to a survey, the sample may no longer be representative of the population.
  • Voluntary response: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will also not be representative of the population.
  • Convenience sample: Individuals who are easily accessible are more likely to be included in the sample.

Observational Studies vs. Experiments

  • Observational study: Researchers collect data in a way that does not directly interfere with how the data arise, i.e. they merely “observe”, and can only establish an association between the explanatory and response variables.
  • Experiment: Researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables.
Correlation
Source: XKCD 552 http://xkcd.com/552/


Correlation does not imply causation!

Simple Random Sampling

Randomly select cases from the population, where there is no implied connection between the points that are selected.

Stratified Sampling

Strata are made up of similar observations. We take a simple random sample from each stratum.

Cluster Sampling

Clusters are usually not made up of homogeneous observations so we take random samples from random samples of clusters.

Principles of experimental design

  1. Control: Compare treatment of interest to a control group.
  2. Randomize: Randomly assign subjects to treatments, and randomly sample from the population whenever possible.
  3. Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study.
  4. Block: If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups.

Difference between blocking and explanatory variables

  • Factors are conditions we can impose on the experimental units.
  • Blocking variables are characteristics that the experimental units come with, that we would like to control for.
  • Blocking is like stratifying, except used in experimental settings when randomly assigning, as opposed to when sampling.

More experimental design terminology…

  • Placebo: fake treatment, often used as the control group for medical studies
  • Placebo effect: experimental units showing improvement simply because they believe they are receiving a special treatment
  • Blinding: when experimental units do not know whether they are in the control or treatment group
  • Double-blind: when both the experimental units and the researchers who interact with the patients do not know who is in the control and who is in the treatment group

Random assignment vs. random sampling

ggplot2

  • ggplot2 is an R package that provides an alternative framework based upon Wilkinson’s (2005) Grammar of Graphics.
  • ggplot2 is, in general, more flexible for creating "prettier" and complex plots.
  • Works by creating layers of different types of objects/geometries (i.e. bars, points, lines, polygons, etc.) ggplot2 has at least three ways of creating plots:
    1. qplot
    2. ggplot(...) + geom_XXX(...) + ...
    3. ggplot(...) + layer(...)
  • We will focus only on the second.

First Example

data(diamonds)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()

Parts of a ggplot2 Statement

  • Data
    ggplot(myDataFrame, aes(x=x, y=y)
  • Layers
    geom_point(), geom_histogram()
  • Facets
    facet_wrap(~ cut), facet_grid(~ cut)
  • Scales
    scale_y_log10()
  • Other options
    ggtitle('my title'), ylim(c(0, 10000)), xlab('x-axis label')

Lots of geoms

ls('package:ggplot2')[grep('geom_', ls('package:ggplot2'))]
##  [1] "geom_abline"          "geom_area"            "geom_bar"            
##  [4] "geom_bin2d"           "geom_blank"           "geom_boxplot"        
##  [7] "geom_contour"         "geom_crossbar"        "geom_density"        
## [10] "geom_density2d"       "geom_dotplot"         "geom_errorbar"       
## [13] "geom_errorbarh"       "geom_freqpoly"        "geom_hex"            
## [16] "geom_histogram"       "geom_hline"           "geom_jitter"         
## [19] "geom_line"            "geom_linerange"       "geom_map"            
## [22] "geom_path"            "geom_point"           "geom_pointrange"     
## [25] "geom_polygon"         "geom_quantile"        "geom_raster"         
## [28] "geom_rect"            "geom_ribbon"          "geom_rug"            
## [31] "geom_segment"         "geom_smooth"          "geom_step"           
## [34] "geom_text"            "geom_tile"            "geom_violin"         
## [37] "geom_vline"           "update_geom_defaults"

Scatterplot Revisited

ggplot(legosets, aes(x=Pieces, y=USD_MSRP)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures, color=Availability)) + geom_point()

Scatterplot Revisited (cont.)

ggplot(legosets, aes(x=Pieces, y=USD_MSRP, size=Minifigures)) + geom_point() + facet_wrap(~ Availability)

Boxplots Revisited

ggplot(legosets, aes(x='Lego', y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot()

Boxplots Revisited (cont.)

ggplot(legosets, aes(x=Availability, y=USD_MSRP)) + geom_boxplot() + coord_flip()